Co- Scheduling Algorithms for High-Throughput 
Workload Execution 

Guillaume Aupy* Manu Shantharam^^ Anne Benoit*-'- Yves Robert*^ 

Padma Raghavan^ 
{guillaume. aupy, anne.benoit, yves.robert}@ens-lyon.fr; 
(^ shantharam.manu@gmail.com; raghavan@cse.psu.edu 

O May 1,2013 

Oh 
< 



a\ 



Abstract 

This paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing 

^vjj each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several 

applications concurrently. We partition the original application set into a series of packs, which are executed one by 

I I one. A pack comprises several applications, each of them with an assigned number of processors, with the constraint 

C/j that the total number of processors assigned within a pack does not exceed the maximum number of available pro- 

[] cessors. The objective is to determine a partition into packs, and an assignment of processors to applications, that 

• minimize the sum of the execution times of the packs. We thoroughly study the complexity of this optimization prob- 

O lem. and propose several heuristics that exhibit very good performance on a variety of workloads, whose application 

I ' execution times model profiles of parallel scientific codes. We show that co-scheduling leads to to faster workload 

completion time and to faster response times on average (hence increasing system throughput and saving energy), for 

K^ significant benefits over traditional scheduling from both the user and system perspectives. 

m 

1^ 1 Introduction 

1^ 



^ 



The execution time of many high-performance computing applications can be significantly reduced when using a large 
(^ number of processors. Indeed, parallel multicore platforms enable the fast processing of very large size jobs, thereby 

(y~) rendering the solution of challenging scientific problems more tractable. However, monopolizing all computing re- 

^-H sources to accelerate the processing of a single application is very likely to lead to inefficient resource usage. This is 

^ because the typical speed-up profile of most applications is sub-linear and even reaches a threshold: when the number 

■ '~j of processors increases, the execution time first decreases, but not linearly, because it suffers from the overhead due to 

rN communications and load imbalance; at some point, adding more resources does not lead to any significant benefit. 

^ In this paper, we consider a pool of several applications that have been submitted for execution. Rather than 

executing each of them in sequence, with the maximum number of available resources, we introduce co-scheduling 
algorithms that execute several applications concurrently. We do increase the individual execution time of each ap- 
plication, but (i) we improve the efficiency of the parallelization, because each application is scheduled on fewer 
resources; (ii) the total execution time will be much shorter; and (iii) the average response time will also be shorter. In 
other words, co-scheduling increases platform yield (thereby saving energy) without sacrificing response time. 

In operating high performance computing systems, the costs of energy consumption can greatly impact the total 
costs of ownership. Consequently, there is a move away from a focus on peak performance (or speed) and towards 
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improving energy efficiency lfT2l l20l . Recent results on improving the energy efficiency of workloads can be broadly 
classified into approaches that focus on dynamic voltage and frequency scaling, or alternatively, task aggregation or 
co-scheduling. In both types of approaches, the individual execution time of an application may increase but there can 
be considerable energy savings in processing a workload. 

More formally, we deal with the following problem: given (i) a distributed-memory platform with p processors, 
and (ii) n applications, or tasks, Ti, with their execution profiles (ti,j is the execution time of Ti with j processors), 
what is the best way to co-schedule them, i.e., to partition them into packs, so as to minimize the sum of the execution 
times over all packs. Here a pack is a subset of tasks, together with a processor assignment for each task. The constraint 
is that the total number of resources assigned to the pack does not exceed p, and the execution time of the pack is the 
longest execution time of a task within that pack. The objective of this paper is to study this co-scheduling problem, 
both theoretically and experimentally. We aim at demonstrating the gain that can be achieved through co-scheduling, 
both on platform yield and response time, using a set of real-life application profiles. 

On the theoretical side, to the best of our knowledge, the complexity of the co-scheduling problem has never been 
investigated, except for the simple case when one enforces that each pack comprises at most fc = 2 tasks |21 ). While 
the problem has polynomial complexity for the latter restriction (with at most fc = 2 tasks per pack), we show that 
it is NP-complete when assuming at most fc > 3 tasks per pack. Note that the instance with fc = p is the general, 
unconstrained, instance of the co-scheduling problem. We also propose an approximation algorithm for the general 
instance. In addition, we propose an optimal processor assignment procedure when the tasks that form a pack are given. 
We use these two results to derive efficient heuristics. Finally, we discuss how to optimally solve small-size instances, 
either through enumerating partitions, or through an integer linear program: this has a potentially exponential cost, but 
allows us to assess the absolute quality of the heuristics that we have designed. Altogether, all these results lay solid 
theoretical foundations for the problem. 

On the experimental side, we study the performance of the heuristics on a variety of workloads, whose application 
execution times model profiles of parallel scientific codes. We focus on three criteria: (i) cost of the co-schedule, i.e., 
total execution time; (ii) packing ratio, which evaluates the idle time of processors during execution; and (iii) response 
time compared to a fully parallel execution of each task starting from shortest task. The proposed heuristics show very 
good performance within a short running time, hence validating the approach. 

The paper is organized as follows. We discuss related work in Sectionl2] The problem is then formally defined in 
SectionIS] Theoretical results are presented in SectionHj exhibiting the problem complexity, discussing sub-problems 
and optimal solutions, and providing an approximation algorithm. Building upon these results, several polynomial- 
time heuristics are described in Section l5| and they are thoroughly evaluated in Section [6] Finally we conclude and 
discuss future work in Section]?] 

2 Related work 

In this paper, we deal with pack scheduling for parallel tasks, aiming at makespan minimization (recall that the 
makespan is the total execution time). The corresponding problem with sequential tasks (tasks that execute on a 
single processor) is easy to solve for the makespan minimization objective: simply make a pack out of the largest p 
tasks, and proceed likewise while there remain tasks. Note that the pack scheduling problem with sequential tasks has 
been widely studied for other objective functions, see Brucker et al. 14] for various job cost functions, and Potts and 
Kovalyov ifTSl for a survey. Back to the problem with sequential tasks and the makespan objective, Koole and Righter 
in lfT3l deal with the case where the execution time of each task is unknown but defined by a probabilistic distribution. 
They showed counter-intuitive properties, that enabled them to derive an algorithm that computes the optimal policy 
when there are two processors, improving the result of Deb and Serfozo [7|, who considered the stochastic problem 
with identical jobs. 

To the best of our knowledge, the problem with parallel tasks has not been studied as such. However, it was 
introduced by Dutot et al. in IS] as a moldable-by-phase model to approximate the moldable problem. The moldable 
task model is similar to the pack-scheduling model, but one does not have the additional constraint (pack constraint) 
that the execution of new tasks cannot start before all tasks in the current pack are completed. Dutot et al. in |[8| 
provide an optimal polynomial-time solution for the problem of pack scheduling identical independent tasks, using a 
dynamic programming algorithm. This is the only instance of pack-scheduling with parallel tasks that we found in the 



literature. 

A closely related problem is the rectangle packing problem, or 2D-Strip-packing. Given a set of rectangles of 
different sizes, the problem consists in packing these rectangles into another rectangle of size p x m. If one sees 
one dimension (p) as the number of processors, and the other dimension (m) as the maximum makespan allowed, 
this problem is identical to the variant of our problem where the number of processors is pre-assigned to each task: 
each rectangle r^ of size pi x rrii that has to be packed can be seen as the task Ti to be computed on pi processors, 
with ti p. = nii. In ll22l . Turek et al. approximated the rectangle packing problem using shelf-based solutions: the 
rectangles are assigned to shelves, whose placements correspond to constant time values. All rectangles assigned to a 
shelf have equal starting times, and the next shelf is placed on top of the previous shelf. This is exactly what we ask 
in our pack-scheduling model. This problem is also called level packing in some papers, and we refer the reader to a 
recent survey on 2D-packing algorithms by Lodi et al. fTSl. In particular, Coffman et al. in [6] show that level packing 
algorithm can reach a 2.7 approximation for the 2D-Strip-packing problem (1.7 when the length of each rectangle is 
bounded by 1). Unfortunately, all these algorithms consider the number of processors (or width of the rectangles) to 
be already fixed for each task, hence they cannot be used directly in our problem for which a key decision is to decide 
the number of processors assigned to each task. 

In practice, pack scheduling is really useful as shown by recent results. Li et al. llTSl propose a framework to predict 
the energy and performance impacts of power-aware MPI task aggregation. Frachtenberg et al. |9| show that system 
utilization can be improved through their schemes to co-schedule jobs based on their load-balancing requirements 
and inter-processor communication patterns. In our earlier work ETIl . we had shown that even when the pack-size 
is limited to 2, co-scheduling based on speed-up profiles can lead to faster workload completion and corresponding 
savings in system energy. 

Several recent publications |J2] |5] [TTI consider co-scheduling at a single multicore node, when contention for 
resources by co-scheduled tasks leads to complex tradeoffs between energy and performance measures. Chandra et 
al. 1^1 predict and utilize inter-thread cache contention at a multicore in order to improve performance. Hankendi 
and Coskun 1 11 1 show that there can be measurable gains in energy per unit of work through the application of their 
multi-level co-scheduling technique at runtime which is based on classifying tasks according to specific performance 
measures. Bhaduria and McKee IS) consider local search heuristics to co-schedule tasks in a resource-aware manner 
at a multicore node to achieve significant gains in thread throughput per watt. 

These publications demonstrate that complex tradeoffs cannot be captured through the use of the speed-up measure 
alone, without significant additional measurements to capture performance variations from cross-application interfer- 
ence at a multicore node. Additionally, as shown in our earlier work |21 1, we expect significant benefits even when we 
aggregate only across multicore nodes because speed-ups suffer due to of the longer latencies of data transfer across 
nodes. We can therefore project savings in energy as being commensurate with the savings in the time to complete a 
workload through co-scheduling. Hence, we only test configurations where no more than a single application can be 
scheduled on a multicore node. 

3 Problem definition 

The application consists of n independent tasks Ti, . . . , r„. The target execution platform consists of p identical 
processors, and each task Ti can be assigned an arbitrary number a{i) of processors, where 1 < a{i) < p. The 
objective is to minimize the total execution time by co-scheduling several tasks onto the p resources. Note that the 
approach is agnostic of the granularity of each processor, which can be either a single CPU or a multicore node. 

Speedup profiles - Let tij be the execution time of task Ti with j processors, and work{i,j) = j x tij be the 
corresponding work. We assume the following for 1 < i < n and I < j < p'- 

Non-increasing execution time: tij-(_i < Uj (1) 

Non-decreasing work: work{i,j) < work{i,j + 1) (2) 

Equation (|T]l implies that execution time is a non-increasing function of the number of processors. Equation (|2| states 
that efficiency decreases with the number of enrolled processors: in other words, parallelization has a cost! As a 
side note, we observe that these requirements make good sense in practice: many scientific tasks Ti are such that ti,j 



first decreases (due to load-balancing) and then increases (due to communication overhead), reaching a minimum for 
j = Jo; we can always let tij = ii,j„ for j > jo by never actually using more than jo processors for T^. 

Co-schedules - A co-schedule partitions the n tasks into groups (called packs), so that (i) all tasks from a given 
pack start their execution at the same time; and (ii) two tasks from different packs have disjoint execution intervals. 
See Figure [T] for an example. The execution time, or cost, of a pack is the maximal execution time of a task in that 
pack, and the cost of a co-schedule is the sum of the costs of each pack. 

fc-IN-p-CoSCHEDULE optimization problem - Given a fixed constant k < p, find a co-schedule with at most k 
tasks per pack that minimizes the execution time. The most general problem is when k = p, but in some frameworks 
we may have an upper bound k < p on the maximum number of tasks within each pack. 
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Figure 1: A co-schedule with four packs Pi to P4. 



4 Theoretical results 



First we discuss the complexity of the problem in Section 4. 1 by exhibiting polynomial and NP-complete instances 



Next we discuss how to optimally schedule a set of k tasks in a single pack (Section 4.2 1. Then we explain how to 



compute the optimal solution (in expected exponential cost) in Section 4.3 Finally, we provide an approximation 
algorithm in Section]?!?! 



4.1 Complexity 

Theorem 1. The 1-lN-p-CoSCHEDULE and 2-lN-p-CoSciiED\]hE problems can both be solved in polynomial time. 

Proof. This result is obvious for 1-lN-p-CoSCHEDULE: each task is assigned exactly p processors (see Equation ([T]l) 
and the minimum execution time is X]r=i ^^p- 

This proof is more involved for 2-lN-p-CoSCHEDULE, and we start with the 2-lN-2-CoSCHEDULE problem to get 
an intuition. Consider the weighted undirected graph G = {V,E), where \V\ — n, each vertex Vi G V corresponding 
to a task T^. The edge set E is the following: (i) for all i, there is a loop on Vi of weight t^ 2; (ii) for all i < i', there is 
an edge between Vi and w^/ of weight max (ti . 1 j ^i' . 1 ) ■ Finding a perfect matching of minimal weight in G leads to the 
optimal solution to 2-lN-2-CoSCHEDULE, which can thus be solved in polynomial time. 

For the 2-lN-p-CoSCHEDULE problem, the proof is similar, the only difference lies in the construction of the 
edge set E: (i) for all i, there is a loop on Vi of weight t^.p; (ii) for all i < i', there is an edge between Vi and Vi' of 
weight minj=i..p (max(ti.p_j, i^/ j)). Again, a perfect matching of minimal weight in G gives the optimal solution to 
2-lN-p-CoSCHEDULE. We conclude that the 2-lN-p-CoSCHEDULE problem can be solved in polynomial time. D 

Theorem 2. When k >'3, the k-lN-p-CoSCHEDVLE problem is strongly NP-complete. 

Proof. We prove the NP-completeness of the decision problem associated to fc-lN-p-CoSCHEDULE: given n indepen- 
dent tasks, p processors, a set of execution times tij for 1 < i < n and I < j < p satisfying Equations ([TJ and (|2|, a 
fixed constant k < p and a deadline D, can we find a co-schedule with at most k tasks per pack, and whose execution 
time does not exceed Dl The problem is obviously in NP: if we have the composition of every pack, and for each task 
in a pack, the number of processors onto which it is assigned, we can verify in polynomial time: (i) that it is indeed a 
pack schedule; (ii) that the execution time is smaller than a given deadline. 



We first prove the strong completeness of 3-lN-p-CoSCHEDULE. We use a reduction from 3-Partition. Consider 
an arbitrary instance Zi of 3-Partition: given an integer B and 3n integers ai, . . . , a3„, can we partition the 3n 
integers into n triplets, each of sum Bl We can assume that J2i2i ^i ~ "-^' otherwise Ii has no solution. The 
3-Partition problem is NP-hard in the strong sense ifTOJ . which implies that we can encode all integers (ai, . . . , a3„, 
B) in unary. We build the following instance Z2 of 3-lN-p-CoSCHEDULE: the number of processors is p = B, the 
deadline is _D = n, there are 3n tasks Ti, with the following execution times: for all i, j, if j < a^ then tij = 1 + — , 
otherwise tij — 1. It is easy to check that Equations ([T]) and (|2]i are both satisfied. For the latter, since there are only 
two possible execution times for each task, we only need to check Equation (|2| for j ~ ai — 1, and we do obtain that 
(ai — 1)(1 + —) < ai. Finally, Z2 has a size polynomial in the size of Zi, even if we write all instance parameters in 
unary: the execution time is n, and the tij have the same size as the a^. 

We now prove that Zi has a solution if and only if I2 does. Assume first that Ii has a solution. For each triplet 
(a.i, aj, ak) of Zi, we create a pack with the three tasks {Ti, Tj,Tk) where T,; is scheduled on a^ processors, Tj on gj 
processors, and Tk on ak processors. By definition, we have ai + aj + ak = B, and the execution time of this pack is 
1. We do this for the n triplets, which gives a valid co-schedule whose total execution time n. Hence the solution to 

Assume now that X2 has a solution. The minimum execution time for any pack is 1 (since it is the minimum 
execution time of any task and a pack cannot be empty). Hence the solution cannot have more than n packs. Because 
there are 3n tasks and the number of elements in a pack is limited to three, there are exactly n packs, each of exactly 
3 elements, and furthermore all these packs have an execution time of 1 (otherwise the deadline n is not matched). If 
there were a pack {Ti , Tj , T^ ) such that 0^+0^+04. > B, then one of the three tasks, say T!; , would have to use fewer 
than Ui processors, hence would have an execution time greater than 1. Therefore, for each pack {Ti, Tj,Tk), we have 
ai + aj + ak < B. The fact that this inequality is an equality for all packs follows from the fact that X]i"i '^i = "^■ 
Finally, we conclude by saying that the set of triplets {ai, aj, ak) for every pack (7^, Tj, Tk) is a solution toXi. 

The final step is to prove the completeness of fc-lN-p-CoSCHEDULE for a given k > A. We perform a similar 
reduction from the same instance Xi of 3-Partition. We construct the instance Z2 of fc-lN-p-CoSCHEDULE where 
the number of processors is p = S + (/c — 3)(J3 + 1) and the deadline is D = n. There are 3n tasks Ti with the same 
execution times as before (for 1 < i < 3n, if j < ai then tij = 1 + — , otherwise tij = 1), and also n{k — 3) new 

identical tasks such that, for 3n + 1 < i < kn, tij — max ( ^^^ .11. It is easy to check that Equations ([T]i and (|2]i are 
also fulfilled for the new tasks. If Zi has a solution, we construct the solution to X2 similarly to the previous reduction, 
and we add to each pack k ~ 3 tasks Ti with 3n + 1 < i < kn, each assigned to B + 1 processors. This solution 
has an execution time exactly equal to n. Conversely, if X2 has a solution, we can verify that there are exactly n packs 
(there are kn tasks and each pack has an execution time at least equal to 1). Then we can verify that there are at most 
{k — 3) tasks Ti with 3n + 1 < i < kn per pack, since there are exactly {k — 3){B + 1) + B processors. Otherwise, if 
there were k — 2 (or more) such tasks in a pack, then one of them would be scheduled on less than B + I processors, 
and the execution time of the pack would be greater than 1. Finally, we can see that in X2, each pack is composed of 
{k — 3) tasks Ti with 3n + l < i < kn, scheduled on {k — 3){B + 1) processors at least, and that there remains triplets 
of tasks Ti, with 1 < i < 3n, scheduled on at most B processors. The end of the proof is identical to the reduction in 
the case fc = 3. D 

Note that the 3-lN-p-CoSCHEDULE problem is NP-complete, and the 2-lN-p-CoSCHEDULE problem can be 
solved in polynomial time, hence 3-lN-3-CoSCHEDULE is the simplest problem whose complexity remains open. 

4.2 Scheduling a pack of tasks 

In this section, we discuss how to optimally schedule a set of k tasks in a single pack: the k tasks Ti, . . . ,Tk are given, 
and we search for an assignment function a : {1. . . . , fc} — ?► {1, . . . ,p} such that X]i=i '''(*) — P' where a{i) is the 
number of processors assigned to task Ti. Such a schedule is called a 1-pack-schedule, and its cost is maxi<i<fe ti^^iy 
In AlgorithmfTlbelow, we use the notation Ti =4cr Tj if ii^cr(i) < ij.a(j)'- 

Theorem 3. Given k tasks to be scheduled on p processors in a single pack, AlgorithmUlfinds a 1-pack-schedule of 
minimum cost in time 0{p\og{k)). 



Algorithm 1: Finding the optimal 1 -pack-schedule a of k tasks in the same pack, 
procedure Optimal- l-pack-schedule(Ti, . . . ,Tk) 
begin 

tor i = 1 to k do 

I 'tC*)^! 
end 
Let L be the list of tasks sorted in non-increasing values of ^^a', 

^available '-^ P "^; 
while Pavailable ^ Ao 

T,. := head(L); 
L ;= tail(L); 

(j{i*) ^ (7{i*) + 1; 

Pavailable ■ — Pavailable -t? 

L :— Insert Ti* in L according to its ^^r value; 
end 

return a; 
end 



In this greedy algorithm, we first assign one processor to each task, and while there are processors that are not 
processing any task, we select the task with the longest execution time and assign an extra processor to this task. 
Algorithm [ij performs p — k iterations to assign the extra processors. We denote by a'-^'' the current value of the 
function a at the end of iteration i. For convenience, we let i, q = +00 for 1 < i < fc. We start with the following 
lemma: 

Lemma: At the end of iteration i of AlgorithmfTl let Ti* be the first task of the sorted list, i.e., the task with longest 
execution time. Then, for all i, ii*^CT('!)(i*) ^ *i,CT(«)(i)-i- 

Proof. Let Ti* be the task with longest execution time at the end of iteration £. For tasks such that a'^^^i) — 1, 
the result is obvious since ti^ = +00. Let us consider any task Ti such that a''^\i) > 1. Let £' + 1 be the last 
iteration when a new processor was assigned to task Ti. a^^ ' {%) = ct'-^' (j) — 1 and (! < £. By definition of iteration 
£' + 1, task Ti was chosen because ti^^{e')/i\ was greater than any other task, in particular ii^o-(*')(i) — *i*,cr(*')(j:*)- 
Also, since we never remove processors from tasks, we have cr^^ ^(i) < a'^^^i) and a'^^ ^(i*) < a^^^{i*). Finally, 

We are now ready to prove TheoremjS] 

of Theorem^ Let a be the 1 -pack-schedule returned by Algorithm [T] of cost c[a), and let Ti* be a task such that 
c(cr) — ii*.CT(i*). Let a' be a 1 -pack-schedule of cost c((t'). We prove below that c{a') > c{(j), hence cr is a 1-pack- 
schedule of minimum cost: 

• If cr'(i*) < cr(i*), then Ti* has fewer processors in cr' than in cr, hence its execution time is larger, and c(cr') > 
c(a). 

• If (j'{i*) > cr(i*), then there exists i such that cr'(i) < a{i) (since the total number of processors is p in both 
a and cr')- We can apply the previous Lemma at the end of the last iteration, where Ti* is the task of maximum 
execution time: ti*,y(i*-^ < ti.o-(i)-i £ ti,a'{i)^ and therefore c(cr') > c(cr). 

Finally, the time complexity is obtained as follows: first we sort k elements, in time 0{k log k). Then there are p — A: 
iterations, and at each iteration, we insert an element in a sorted list of fc — 1 elements, which takes 0(log k) operations 
(use a heap for the data structure of L). D 

Note that it is easy to compute an optimal 1 -pack-schedule using a dynamic -programming algorithm: the optimal 
cost is c{k,p), which we compute using the recurrence formula 

c(z, q) = min {max(c(i - 1, 9 - q'),ti_q>)} 

l<q'<q 



for 2 < i < A; and 1 < q < p, initialized by c(l, q) ~ ii q, and c(«, 0) = +oo. The complexity of this algorithm is 
0{kp^). However, we can significantly reduce the complexity of this algorithm by using Algorithmfl] 

4.3 Computing the optimal solution 

In this section we sketch two methods to find the optimal solution to the general fc-lN-p-CoSCHEDULE problem. This 
can be useful to solve some small-size instances, albeit at the price of a cost exponential in the number of tasks n. 

The first method is to generate all possible partitions of the tasks into packs. This amounts to computing all 
partitions of n elements into subsets of cardinal at most k. For a given partition of tasks into packs, we use AlgorithmfT] 
to find the optimal processor assignment for each pack, and we can compute the optimal cost for the partition. There 
remains to take the minimum of these costs among all partitions. 

The second method is to cast the problem in terms of an integer Unear program: 

Theorem 4. The following integer linear program characterizes the k-W-p-CoScUED\]LE problem, where the un- 
known variables are the Xij^b 's (Boolean variables) and the yt 's (rational variables), for 1 < i,b < n and I < j < p: 

Minimize X^tLi lib subject to 

(i) J2jM^i-J,b = 1, l<i <n 

(ii) J2i,j Xt,3,b <k, \<b<n (3) 

C'"-' Ej J J X ^i,j,b <P, l<b<n 

(iv) Xij^b X ti,j <yb, ^ <i,b <n,l< j <p 

Proof. The x^^ b's are such that a;i.j,6 = 1 if and only if task Ti is in the pack b and it is executed on j processors; yi, 
is the execution time of pack b. Since there are no more than n packs (one task per pack), b < n. The sum J2b=i Vb 
is therefore the total execution time (j/f, = if there are no tasks in pack b). Constraint (i) states that each task is 
assigned to exactly one pack &, and with one number of processors j. Constraint (ii) ensures that there are not more 
than k tasks in a pack. Constraint (iii) adds up the number of processors in pack 6, which should not exceed p. Finally, 
constraint (iv) computes the cost of each pack. D 

4.4 Approximation algorithm 

In this section we introduce PACK- APPROX, a 3-approximation algorithm for the p-lN-p-CoSCHEDULE problem. The 
design principle of PACK- APPROX is the following: we start from the assignment where each task is executed on one 
processor, and use Algorithm l2] to build a first solution. Algorithmic] is a greedy heuristic that builds a co-schedule 
when each task is pre-assigned a number of processors for execution. Then we iteratively refine the solution, adding a 
processor to the task with longest execution time, and re-executing Algorithm|2] Here are details on both algorithms: 

Algorithm^ The fc-lN-p-CoSCHEDULE problem with processor pre-assignments remains strongly NP-complete 
(use a similar reduction as in the proof of Theoreml2]). We propose a greedy procedure in Algorithml2]which is similar 
to the First Fit Decreasing Height algorithm for strip packing |6|. The output is a co-schedule with at most k tasks per 
pack, and the complexity is 0(nlog(n)) (dominated by sorting). 

Algorithm p] We iterate the calls to Algorithm l2j adding a processor to the task with longest execution time, 
until: (i) either the task of longest execution time is already assigned p processors, or (ii) the sum of the work of 
all tasks is greater than p times the longest execution time. The algorithm returns the minimum cost found during 
execution. The complexity of this algorithm is 0{in?p) (in the calls to Algorithm we do not need to re-sort the list 
but maintain it sorted instead) in the simplest version presented here, but can be reduced to 0(nlog(7i) + np) using 
standard algorithmic techniques. 

Theorem 5. PACK-Approx is a 3-approximation algorithm for the p-m-p-CoScnEXiVhE problem. 

Proof. We start with some notations: 

• step i denotes the i*'' iteration of the main loop of Algorithm PACK- Approx; 

• (7^'^'^ is the allocation function at step i; 

• ^max(*) = maxj i cr(>)(7) IS the maximum execution time of any task at step i; 



Algorithm 2: Creating packs of size at most k, when the number a{i) of processors per task Ti is fixed. 

procedure MAKE-PACK(n,p, fc, a) 
begin 

Let L be the Hst of tasks sorted in non-increasing values of execution times ti,j(^iy, 
while L 7^ do 

Schedule the current task on the first pack with enough available processors and fewer than k tasks. 
Create a new pack if no existing pack fits; 
Remove the current task from L; 
end 

return the set of packs 
end 



Algorithm 3: pack-Approx 
procedure PACK- APPROX(Ti , . . . , r„) 
begin 

COST = +00 ; 

for j = lto n do cr(j) -s— 1; 

for z = to n{p — 1) — 1 do 

Let Aot(i) = E"=i ^j,^(j)'^(j); 

Let Tj* be one task that maximizes tj,y(^jy. 
Call Make-pack (n,p,p, a); 
Let COSTi be the cost of the co-schedule; 
if COST^ < CO^rthen COST ^ COST,; 

if f^^ > ij%aO*) )or(a{j*) = pj then return C6»5r; /* Exit loop */ 

else a{j*) ^r~ a{j*) + I; / * Add a processor to Tj* */ 

end 

return COST; 
end 



• j*{i) is the index of the task with longest execution time at step i (break ties arbitrarily); 

• Aot(*) = X^i ^j o-'^Hj)'^^*'' (j) i^ '■h^ '■°'-^l work that has to be done at step i; 

• COSTi is the result of the scheduling procedure at the end of step i; 

• OPT denotes an optimal solution, with allocation function a'^"'^\ execution time COSTqpt, and total work 

j 

Note that there are three different ways to exit algorithm PACK-APPROX: 

1. If we cannot add processors to the task with longest execution time, i.e., a'^^'> {j* (i)) — p; 

2. If — !2Lill > tj,jax(j) after having computed the execution time for this assignment; 

3. When each task has been assigned p processors (the last step of the loop "for": we have assigned exactly np 
processors, and no task can be assigned more than p processors). 



Lemma 1. At the end of step i, COSTi < SmaxlimaxC*), 

Proof. Consider the packs returned by Algorithm[2] sorted by non-increasing execution times, Bi,B2, ■ ■ ■ , Bn (some 
of the packs may be empty, with an execution time 0). Let us denote, for 1 < q < n, 

• jq the task with the longest execution time of pack Bq (i.e., the first task scheduled on Bq); 

• tq the execution time of pack Bq (in particular, tq^tj a('>(j ))' 



• Aq the sum of the task works in pack B^; 

• Pq the number of processors available in pack Bq when jq^i was scheduled in pack Bq+i. 

With these notations, COST^ — X]g=i ^q ^^^ Aot(*) = X]q=i^«- P^i" &^c\\ pack, note that ptq > Aq, since ptq is 
the maximum work that can be done on p processors with an execution time of tq. Hence, COST^ > — ssiii 

In order to bound COST^, let us first remark that ct^*' (jg+i) > Pq- otherwise jg+i would have been scheduled on 
pack Bq. Then, we can exhibit a lower bound for Aq, namely Aq > tg+i(p — Pq). Indeed, the tasks scheduled before 
jg+i all have a length greater than t^+i by definition. Furthermore, obviously Aq^i > tq^iPq (the work of the first 
task scheduled in pack Bq+i). So finally we have, Aq + A^+i > tg+ip. 

Summing over all g's, we have: 2 YZ=i ~ - 5^g=2 *?' hence 2^^!=^ + ii > COSTj. Finally, note that h = 
imax(*)' ^"'1 therefore COST; < 3 max (tmax(«), '° )• Note that this proof is similar to the one for the Strip- 
Packing problem in |6|. D 

Lemma 2. At each step i, A,ot{i + 1) > A,ot{i) and irnax(* + 1) < imax(i)> '•^•> ^^^ total work is increasing and the 
maximum execution time is decreasing. 

Proof. Aot(* + 1) = Aot(*) — a + b, where 

• a — work{j*{i), a'-^''{j*{i))), and 

• & = worfc(j*(z),a(*+i)(j*(i))). 

But 6 = work{j*{i),a^''^j*{i))+l)anda < 6 by Equation (|2|. Therefore, Aot(i+l) > Aot(i)- Finally, tmax(«+l) < 
imax(*) since only one of the tasks with the longest execution time is modified, and its execution time can only decrease 
thanks to Equation ([T|i. D 

Lemma 3. Given an optimal solution OPT, Vj, i. cr'o^Hj) — COSTqpj and ^opt ^ pCOST^pj. 

Proof. The first inequality is obvious. As for the second one, pCOSTopT is the maximum work that can be done on 
p processors within an execution time of COSTqpt, hence it must not be smaller than Aqpt, which is the sum of the 
work of the tasks with the optimal allocation. D 

Lemma 4. For any step i such that imax(i) > COSTqpt, then Vj, ct*^*^ (j) < 0''°''^-' (j), and Atot{i) < Aqpt- 

Proof Consider a task Tj. If cr^^^j) = 1, then cleai'ly cr(*)(j) < CT(°")(i). Otherwise, cr(*)(j) > 1, and then by 
definition of the algorithm, there was a step i' < i, such that a'^^ ^(j) = cr'-^'O) ~ 1 ^i^^ cr'-* ^^Hi) — <^'"^Hj)- 
Therefore t,„ax(*') = ^j a-'-'-'Hi)- Following Lemma [2] we have iinax(i') > ^max(*) > COSTqpt. Then necessarily, 
(j(opT)^j-j ^ ^{i )q-j^ hence the result. Finally, Aot(*) < ^opt is a simple corollary of the previous result and of 
Equation ([2|. D 

Lemma 5. For any step i such that iniax(i) > COSTqpj, then ""^^' < imaxl*)- 

Proof Thanks to LemmaH we have ^^^ < ^. Lemmalilgives us ^ < COSTqpt, hence the result. D 

Lemma 6. There exists io > such that trnax(*o ~ 1) > COSTqpj > imax(io) (we let imax(^l) = +ooj. 

Proof. We show this result by contradiction. Suppose such iq does not exist. Then imax(O) > COSTqpt (otherwise 
if) = would suffice). Let us call ii the last step of the run of the algorithm. Then by induction we have the following 
property, t„iax(0) > tmax(l) > •■■ > imax(«i) > COSTqpt (otherwise io would exist, hence contradicting our 
hypothesis). Recall that there are three ways to exit the algorithm, hence three possible definitions for ii: 

• o-^'i^(j*(ii)) = p, however then we would have tmax(*i) = ij*{ii),p > COSTqpt > tj*(jj)_o.(°") (according to 
LemmalSJl. This contradicts Equation ([T]i, which states that tj'-{i^)^p < tj*{ii),k for all k. 

• ii — n{p — 1) — 1, but then we have the same result, i.e., cr^*^-' (j*(*i)) — P because this is true for all tasks. 



imax(*i) < ' p , but this is false according to Lemma 5 



We have seen that PACK-APPROX could not have terminated at step ii, however since PACK-APPROX terminates (in 
at most n{p — 1) — 1 steps), we have a contradiction. Hence we have shown the existence of io- D 

Lemma 7. A,o,(io) < ^opt- 

Proof. Consider step iq. If io — 0, then at this step, all tasks are scheduled on exactly one processor, and Vj, cr'^*") (i) < 
o-(opT)^j-j Therefore, Aot(«o) < ^opt- If io ¥" 0' consider step io — 1: imaxl^o ~ 1) > COSTqpt- From LemmakI we 
have Vj, (t(*°~^^(j) < cr(°"'(j). Furthermore, it is easy to see that Vj ^ j*{io — l),cr''*''''(j) = ^''■'"^^''(j) since no 
task other than j*(io — 1) is modified. We also have the following properties: 

• ^J*(io-l),<T('o-i)(j*(jo-l)) = *max(«0 " 1); 

• ^max(*o ~ 1) > ^oPT (by definition of step iq); 

• Wt > ij*(jo_i),^(opT)(j*(,„_i)) (Lemma[3|; 

• ^(^^Hj^^o - 1)) - ^(^"~'Hr(«o - 1)) + 1. 

The three first properties and Equation ([T]| allow us to say that o''*''^^-'(j*(io ^ 1)) < f'^°"^(j*(io — 1)). Thanks 
to the fourth property, cr^'-°\3*iiQ - 1)) < (J^°''"'\j). Finally, we have, for all j, cr(*")(j) < a^°'"^\j), and therefore 
Aot(*o) < ^opT by Equation (|2|. D 

We are now ready to prove the theorem. For ip introduced in Lemmal6] we have: 

Aot(«o)" 



COSTio < 3 max Umax(io), 
< 3 max ( COSTopT, 



V P 

< 3COSToPT 

The first inequality comes from Lemmafl] The second inequality is due to Lemma [6] and 17] The last inequality comes 
from Lemma[3] hence the final result. D 

5 Heuristics 

In this section, we describe the heuristics that we use to solve the /c-lN-p-CoSCHEDULE problem. 

Random-Pack- In this heuristic, we generate the packs randomly: as long as there remain tasks, randomly choose 

an integer j between 1 and k, and then randomly select j tasks to form a pack. Once the packs are generated, apply 

Algorithm [T] to optimally schedule each of them. 

Random-Proc- In this heuristic, we assign the number of processors to each task randomly between 1 and p, then 

use Algorithmic] to generate the packs, followed by Algorithm[T]on each pack. 

A word of caution- We point out that Random-Pack and Random-Proc are not pure random heuristics, in that 

they already benefit from the theoretical results of Section |4] A more naive heuristic would pick both a task and 

a number of processor randomly, and greedily build packs, creating a new one as soon as more than p resources 

are assigned within the current pack. Here, both Random-Pack and Random-Proc use the optimal resource 

allocation strategy (Algorithm [Tji within a pack; in addition, Random-Proc uses an efficient partitioning algorithm 

(Algorithm |2]i to create packs when resources are pre-assigned to tasks. 

PACK-Approx- This heuristic is an extension of Algorithml3]in Section 4.4 to deal with packs of size k rather than p: 



simply call Make-PACK (n,p, k, a) instead of Make-PACK {n,p,p, a). However, although we keep the same name 



as in Section 4.4 for simplicity, we point out that it is unknown whether this heuristic is a 3-approximation algorithm 
for arbitrary k. 

PACK-BY-PACK (e)- The rationale for this heuristic is to create packs that are well-balanced: the difference between 
the smallest and longest execution times in each pack should be as small as possible. Initially, we assign one processor 
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per task (for I < i < n, a{i) = 1), and tasks are sorted into a list L ordered by non-increasing execution times {^^r 
values). While there remain some tasks in L, let Ti* be the first task of the list, and let i^ax = ii* .cr{i*)- Let Vreq be 
the ordered set of tasks Ti such that ^^^^(i) > (1 ^ £)iinax: this is the sublist of tasks (including Ti* as its first element) 
whose execution times are close to the longest execution time imax. and e G [0, 1] is some parameter Let p^eq be the 
total number of processors requested by tasks in Vreq- ^ipreq > P, a new pack is created greedily with the first tasks of 
Vred, adding them into the pack while there are no more than p processors used and no more than k tasks in the pack. 
The corresponding tasks are removed from the list L. Note that Ti* is always inserted in the created pack. Also, if we 
have cr(i*) = p, then a new pack with only Ti* is created. Otherwise (preq < p), an additional processor is assigned 
to the (currently) critical task Ti*, hence cr(i*) := (T{i*) + 1, and the process iterates after the list L is updated with 
the insertion of the new value for Tj* Finally, once all packs are created, we apply Algorithm [T] in each pack, so as to 
derive the optimal schedule within each pack. 

We have < £ < 1. A small value of e will lead to balanced packs, but may end up with a single task with p 
processors per pack. Conversely, a large value of e will create new packs more easily, i.e., with fewer processors per 
task. The idea is therefore to call the heuristic with different values of e, and to select the solution that leads to the best 
execution time. 

Summary of heuristics- We consider two variants of the random heuristics, either with one single run, or with 9 
different runs, hence hoping to obtain a better solution, at the price of a slightly longer execution time. These heuristics 
are denoted respectively Random-Pack-1, Random-Pack-9, Random-Proc-1, Random-Proc-9. Similarly, 
for PACK-BY-PACK, we either use one single run with e = 0.5 (PACK-BY-PACK-1), or 9 runs with e G {.1, .2, . . . , .9} 
(PACK-BY-PACK-9). Of course, there is only one variant of PACK- APPROX, hence leading to seven heuristics. 
Variants- We have investigated variants of PACK-BY-PACK, trying to make a better choice than the greedy choice to 
create the packs, for instance using a dynamic programming algorithm to minimize processor idle times in the pack. 
However, there was very little improvement at the price of a much higher running time of the heuristics. Additionally, 
we tried to improve heuristics with up to 99 runs, both for the random ones and for PACK-BY-PACK, but here again, 
the gain in performance was negligible compared to the increase in running time. Therefore we present only results 
for these seven heuristics in the following. 

6 Experimental Results 

In this section, we study the performance of the seven heuristics on workloads of parallel tasks. First we describe 
the workloads, whose application execution times model profiles of parallel scientific codes. Then we present the 
measures used to evaluate the quality of the schedules, and finally we discuss the results. 

Workloads- Workload-I corresponds to 10 parallel scientific applications that involve VASP |14|, ABAQUS f3l, 
LAMMPS 1 17 1 and Petsc 1 1 1 . The execution times of these applications were observed on a cluster with Intel Nehalem 
8-core nodes connected by a QDR Infiniband network with a total of 128 cores. In other words, we have p — 16 
processors, and each processor is a multicore node. 

Workload-II is a synthetic test suite that was designed to represent a larger set of scientific applications. It models 
tasks whose parallel execution time for a fixed problem size m on g cores is of the form t{m, q) = f x t{m, 1) + (1 — 
/) '^"^ ' + K{m, q), where / can be interpreted as the inherently serial fraction, and k represents overheads related 
to synchronization and the communication of data. We consider tasks with sequential times t{m, 1) of the form cm, 
cm\og2 n, cm? and cm?, where c is a suitable constant. We consider values of / in {0, 0.04, 0.08, .16, .32}, with 
overheads K(r7i, q) of the form log2 q, (log2 q^, q log2 q, — log2 q, \fmjq, and vti log2 q to create a workload with 65 
tasks executing on up to 128 cores. 

The same process was also used to develop Workload-Ill, our largest synthetic test suite with 260 tasks for 256 cores 
(and p = 32 multicore nodes), to study the scalability of our heuristics. For all workloads, we modified speedup 
profiles to satisfy Equations ([T]i and (|2]i. 

As discussed in related work (see Section |2| and ETl . and confirmed by power measurement using Watts Up Pro 
meters, we observed only minor power consumption variations of less than 5% when we limited co-scheduling to 
occur across multicore nodes. Therefore, we only test configurations where no more than a single application can be 
scheduled on a given multicore node comprising 8 cores. Adding a processor to an application Ti which is already 
assigned Gi processors actually means adding 8 new cores (a full multicore node) to the Sct^ existing cores. Hence a 
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pack size of k corresponds to the use of at most 8fc cores for applications in each pack. For Workloads-I and II, there 
are 16 nodes and 128 cores, while Workload-Ill has up to 32 nodes and 256 cores. 

Methodology for assessing the heuristics- To evaluate the quality of the schedules generated by our heuristics, we 
consider three measures: Relative cost, Packing ratio, and Relative response time. Recall that the cost of a pack is the 
maximum execution time of a task in that pack and the cost of a co-schedule is the sum of the costs over all its packs. 

We define the relative cost as the cost of a given co-schedule divided by the cost of a 1-pack schedule, i.e., one 
with each task running at maximum speed on all p processors. 

For a given fc-lN-p-CoSCHEDULE, consider X^iLi ^i,(y(i) ^ '''(*)' i-^-' the total work performed in the co-schedule 
when the i-th task is assigned ij{i) processors. We define the packing ratio as this sum divided by p times the cost of 
the co-schedule; observe that the packing quality is high when this ratio is close to 1, meaning that there is almost no 
idle time in the schedule. 

An individual user could be concerned about an increase in response time and a corresponding degradation of individ- 
ual productivity. To assess the impact on response time, we consider the performance with respect to a relative response 
time measure defined as follows. We consider a 1-pack schedule with the n tasks sorted in non-decreasing order of 
execution time, i.e., in a "shortest task first" order, to yield a minimal value of the response time. If this ordering is 
given by the permutation 7r(i), i = 1, 2, . . . , n, the response time of task i is r^ = Sj=i ^■t^(j),p and the mean response 
time is i? = - X]r=i ^j- ^'^^ ^ given fc-lN-p-CoSCHEDULE with u packs scheduled in increasing order of the costs of 
a pack, the response time of task i in pack v,\ < v <u, assigned to a{i) processors, is: fi — X^fci cost{i) + tiut^i), 
where cost{i) is the cost of the ^-th pack for 1 < i < u. The mean response time of the /c-lN-p-CoSCHEDULE R is 
calculated using these values and we use ^ as the relative response time. 

Results for small and medium workloads- For Workload-I, we consider packs of size k = 2, 4, 6, 8, 10 with 16 
processors (hence a total of 128 cores). Note that we do not try fc = p = 16 since there are only 10 applications in this 
workload. For Workload-II, we consider packs of size fc = 2, 4, 6, 8, 10, 12, 14, 16. 

Figure [2] shows the relative cost of co-schedules computed by the heuristics. For Workload-I (Figure l2l a)), the 
optimal co-schedule was constructed using exhaustive search. We observe that the optimal co-schedule has costs 
that are more than 35% smaller than the cost of a 1-pack schedule for Workload-I. Additionally, we observe that 
PACK-Approx and pack-by-pack compute co-schedules that are very close to the optimal one for all values of 
the pack size. Both Random-Pack and Random-Proc perform poorly when compared to PACK-BY-PACK and 
PACK-Approx, especially when a single run is performed. As expected, Random-Proc does better than Random- 
Pack because it benefits from the use of Algorithmic] and for this small workload, Random-Proc-9 almost always 
succeed to find a near-optimal co-schedule. The results are similar for the larger Workload-II as shown in Figurel2tb), 
with an increased gap between random heuristics and the packing ones. Computing the optimal co-schedule was not 
feasible because of the exponential growth in running times for exhaustive search. With respect to the cost of a 1-pack 
schedule, we observe very significant benefits, with a reduction in costs of most than 80% for larger values of the pack 
size, and in particular in the unconstrained case where k = p = 16. This corresponds to significant savings in energy 
consumed by the hardware for servicing a specific workload. 

Figure l3] shows the quality of packing achieved by the heuristics. The packing ratios are very close to one for 
PACK-BY-PACK and PACK-Approx, indicating that our methods are producing high quality packings. In most cases, 
Random-Proc and Random-Pack also lead to high packing ratios. 

Finally, Figurel4]shows that PACK-BY-PACK and PACK- APPROX produce lower cost schedules with commensurate 
reductions in response times. For Workload-II and larger values of the pack size, response time gains are over 80%, 
making fc-lN-p-CoSCHEDULE attractive from the user perspective. 

ScalabiUty- Figure l5] shows scalability trends for Workload-Ill with 260 tasks on 32 processors (hence a total of 256 
cores.) Although the heuristics, including Random-Pack and Random-Proc, result in reducing costs relative to 
those for a 1-pack schedule, PACK-APPROX and PACK-BY- PACK are clearly superior, even when the random schemes 
are run 9 times. We observe that for pack sizes of 16 and 32, PACK- APPROX and PACK-BY- PACK produce high quality 
co-schedules with costs and response times that are respectively 90% and 80% lower than those for a 1-pack schedule. 
PACK-BY-PACK-1 obtains results that are very close to those of PACK-BY- PACK-9, hence even a single run returns a 
high quality co-schedule. 

Running times- We report in Table [T] the running times of the seven heuristics. All heuristics run within a few 
milliseconds, even for the largest workload. Note that PACK- APPROX was faster on Workload-II than Workload-I 
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Figure 2: Quality of co-schedules: Relative costs are shown in (a) for Workload-I and in (b) for Workload-II. The 
horizontal line in (a) indicates the relative cost of an optimal co-schedule for Workload-I. 
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Figure 3: Quality of packs: Packing ratios are shown in (a) for Workload-I and in (b) for Workload-II. The horizontal 
line in (a) indicates the packing ratio of an optimal co-schedule for Workload-I. 
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because its execution performed fewer iterations in this case. Random heuristics are slower than the other heuristics, 
because of the cost of random number generation. PACK-BY-PACK has comparable running times with PACK-APPROX, 
even when 9 values of e are used. 





Workload-I 


Workload-II 


Workload-ni 


pack-Approx 


0.50 


0.30 


5.12 


PACK-BY-PACK-1 


0.03 


0.12 


0.53 


PACK-BY-PACK-9 


0.30 


1.17 


5.07 


Random-Pack- 1 


0.07 


0.34 


9.30 


Random-Pack-9 


0.67 


2.71 


87.25 


Random-Proc-1 


0.05 


0.26 


4.49 


Random-Proc-9 


0.47 


2.26 


39.54 



Table 1: Average running times in milliseconds. 

Summary of experimental results- Results indicate that heuristics pack-Approx and pack-by- pack both produce 
co-schedules of comparable quality. PACK-BY-PACK-9 is slightly better than PACK-BY-PACK-1, at a price of an in- 
crease in the running time from using more values of e. However, the running time remains very small, and similar to 
that of pack-Approx. Using more values of e to improve pack-by-pack leads to small gains in performance (e.g, 
1% gain for PACK-BY-PACK-9 compared to PACK-BY-PACK-1 for fc = 16 in Workload-II). However, these small gains 
in performance correspond to significant gains in system throughput and energy, and far outweigh the costs of com- 
puting multiple co-schedules. This makes PACK-BY-PACK-9 the heuristic of choice. Our experiments with 99 values 
of £ did not improve performance, indicating that large increases in the number of e values may not be necessary. 

7 Conclusion 

We have developed and analyzed co-scheduling algorithms for processing a workload of parallel tasks. Tasks are 
assigned to processors and are partitioned into packs of size k with the constraint that the total number of processors 
assigned over all tasks in a pack does not exceed p, the maximum number of available processors. Tasks in each pack 
execute concurrently on a number of processors, and workload completes in time equal to sum of the execution times 
of the packs. We have provided complexity results for minimizing the sum of the execution times of the packs. The 
bad news is that this optimization problem is NP-complete. This does not come as a surprise because we have to 
choose for each task both a number of processors and a pack, and this double freedom induces a huge combinatorial 
solution space. The good news is that we have provided an optimal resource allocation strategy once the packs are 
formed, together with an efficient load-balancing algorithm to partition tasks with pre-assigned resources into packs. 
This load-balancing algorithm is proven to be a 3-approximation algorithm for the most general instance of the prob- 
lem. Building upon these positive results, we have developed several heuristics that exhibit very good performance in 
our test sets. These heuristics can significantly reduce the time for completion of a workload for corresponding savings 
in system energy costs. Additionally, these savings come along with measurable benefits in the average response time 
for task completion, thus making it attractive from the user's viewpoint. 

These co-schedules can be computed very rapidly when speed-up profile data are available. Additionally, they operate 
at the scale of workloads with a few to several hundred applications to deliver significant gains in energy and time 
per workload. These properties present opportunities for developing hybrid approaches that can additionally leverage 
dynamic voltage and frequency scaling (DVFS) within an application. For example, Rountree et al. 1.19,1 have shown 
that depending on the properties of the application, DVFS can be applied at runtime through their Adagio system, to 
yield system energy savings of 5% to 20%. A potential hybrid scheme could start with the computation of a fc-lN-p- 
CoSCHEDULE for a workload, following which DVFS could be applied at runtime per application. 
Our work indicates the potential benefits of co-schedules for high performance computing installations where even 
medium-scale facilities consume Megawatts of power We plan to further test and extend this approach towards de- 
ployment in university scale computing facilities where workload attributes often do not vary much over weeks to 
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months and energy costs can be a limiting factor. 
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Figure 4: Relative response times are shown in (a) for Workload-I and in (b) for Workload-II; values less than 1 
indicate improvements in response times. The horizontal line in (a) indicates the relative response time of an optimal 
co-schedule for Workload-I. 
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Figure 5: Relative costs, packing ratios and relative response times of co-schedules for Workload-Ill on 256 cores. 
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